Statistical and constraint-based taggers for French
نویسنده
چکیده
In this paper we compare two competing approaches to part-of-speech tagging, statistical and constraint-based disambigua-tion, using French as our test language. We imposed a time limit on our experiment: the amount of time spent on the design of our constraint system was about the same as the time we used to train and test the easy-to-implement statistical model. We describe the two systems and compare the results. The accuracy of the statistical method is reasonably good, comparable to taggers for English. But the constraint-based tagger seems to be superior even with the limited time we allowed ourselves for rule development. 1 Overview In this paper we compare two competing approaches to part-of-speech tagging, statistical and constraint-based disambiguation, using French as our test language. The process of tagging consists of three stages: tokenisation, morphological analysis and dis-ambiguation. The two taggers include the same to-keniser and morphological analyser. The tokeniser uses a nite-state transducer that reads the input and outputs a token whenever it has read far enough to be sure that a token is detected. The morphological analyser contains a transducer lexicon. It produces all the legitimate tags for words that appear in the lexicon. If a word is not in the lexicon, a guesser is consulted. The guesser employs another nite-state transducer. It reads a token and prints out a set of tags depending on preexes, in-ectional information and productive endings that it nds. We make even more use of transducers in the constraint-based tagger. The tagger reads one sentence at a time, a string of words and alternative tags, feeds them to the grammaticaltransducers that remove all but one alternative tag from all the words on the basis of contextual information. If all the transducers described above (tokeniser, morphological analyser and disambiguator) could be composed together, we would get one single transducer that transforms a raw input text to a fully disambiguated output. The statistical method contains the same to-keniser and morphological analyser. The disam-biguation method is a conventional one: a hidden Markov model. 2 Morphological analysis and guessing 2.1 Morphological analyser The morphological analyser is based on a lexical transducer (Karttunen et al., 1992; Karttunen, 1994). The source lexicon and rules are represented as in the two-level model (Koskenniemi, 1983). They are compiled into a single nite-state transducer using Xerox lexical tools (Karttunen and Beesley, 1992; Karttunen, 1993). The transducer maps each innected surface form of a word …
منابع مشابه
Tagging French - comparing a statistical and a constraint-based method
In this paper we compare two competing approaches to part-of-speech tagging, statistical and constraint-based disambiguation, using French as our test language. We imposed a time limit on our experiment: the amount of time spent on the design of our constraint system was about the same as the time we used to train and test the easy-to-implement statistical model. We describe the two systems and...
متن کاملCreating a tagset, lexicon and guesser for a French tagger
We earlier described two taggers for French, a statistical one and a constraint-based one. The two taggers have the same tokeniser and morphological analyser. In this paper, we describe aspects of this work concerned with the definition of the tagset, the building of the lexicon, derived from an existing two-level morphological analyser, and the definition of a lexical transducer for guessing u...
متن کاملTraining and Evaluation of POS Taggers on the French MULTITAG Corpus
The explicit introduction of morphosyntactic information into statistical machine translation approaches is receiving an important focus of attention. The current freely available Part of Speech (POS) taggers for the French language are based on a limited tagset which does not account for some flectional particularities. Moreover, there is a lack of a unified framework of training and evaluatio...
متن کاملcomparing a statistical and a constraint - based method
In this paper we compare two competing approaches to part-of-speech tagging, statistical and constraint-based disam-biguation, using French as our test language. We imposed a time limit on our experiment: the amount of time spent on the design of our constraint system was about the same as the time we used to train and test the easy-to-implement statistical model. We describe the two systems an...
متن کاملTransformation-Based Learning of Rules for Constraint Grammar Tagging
If we conceive of a Constraint Grammar as an ordered sequence of transformation rules of a particular kind – as reduction rules rather than replacement rules – the transformation-based learning method used to train Brill taggers can, with minor modifications, be used to train Constraint Grammar taggers as well. This paper makes a few observations based on this approach, and presents some initia...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 1994